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Given a sequence of observations from a discrete-time, finite-state hidden Markov model, we 
would like to estimate the sampling distribution of a statistic. The bootstrap method is em- 
ployed to approximate the confidence regions of a multi-dimensional parameter. We propose an 
importance sampling formula for efficient simulation in this context. Our approach consists of 
constructing a locally asymptotically normal (LAN) family of probability distributions around 
the default resampling rule and then minimizing the asymptotic variance within the LAN family. 
The solution of this minimization problem characterizes the asymptotically optimal resampling 
scheme, which is given by a tilting formula. The implementation of the tilting formula is facil- 
itated by solving a Poisson equation. A few numerical examples are given to demonstrate the 
efficiency of the proposed importance sampling scheme. 

Keywords: Locally asymptotical normal; Markov random walk; bootstrap; Poisson equation; 
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1. Introduction 

Statistical inference for hidden Markov models has recently received some attention due 
to its importance in applications to speech recognition (Rabiner and Juang [24]), signal 
processing (Elliott et al. [10]), ion channel studies (Ball and Rice [3]) and molecular bi- 
ology (Krogh et al. [16]). Good summaries on the subject are given by MacDonald and 
Zucchini [20], Kunsch [17] and Cappe et al. [7]. Likelihood-based inference for hidden 
Markov models was first considered by Baum and Petrie [4]. Leroux [19] proved consis- 
tency of the maximum likelihood estimator (MLE) for hidden Markov chains under mild 
conditions. Asymptotic normality of the MLE was established by Bickcl et al. [5]. 

Although asymptotic normality can be used to construct confidence regions for the pa- 
rameter of interest, the lack of accuracy in the asymptotic approximation to the sampling 
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distribution as well as the computational difficulty of the asymptotic variance-covariance 
matrix make it less suitable for applications. Therefore, the bootstrap method becomes 
a useful alternative. The application of the bootstrap method to hidden Markov mod- 
els was studied by Albert [1] and Staffer and Wall [25]. As the bootstrap estimate is 
obtained by Monte Carlo estimation, we need to find efficient ways to do simulation. 
This is particularly important for hidden Markov models, where high accuracy is often 
required, the estimate needs to be recomputed many times and each time a substantial 
amount of computation is required. For instance, when the EM (Baum- Welch) algorithm 
is employed to approximate the MLE, it is computed a number of times so that the error 
in bootstrap estimates can be assessed. 

Johns [15] and Davison [8] suggested using importance sampling to construct bootstrap 
confidence intervals and showed that it has potential for dramatic improvement over uni- 
form resampling. Later, Do and Hall [9] complemented it with comprehensive derivation 
and an empirical version. However, their method encounters difficulty in multi-parameter 
cases. Fuh and Hu [12] overcame the difficulty and provided an optimal tilting formula 
for the multi-parameter case. This helps the study of importance sampling in hidden 
Markov models, where the parameter space is usually multi-dimensional. 

The remaining challenge is to deal with Markovian dependence. To begin, we need to 
determine a family of tilted distributions that contains the optimal resampling distribu- 
tion, so that the optimization problem is non-trivial and solvable. Our first contribution 
is the construction a locally asymptotically normal (LAN) family of probability distribu- 
tions around the default resampling rule. It turns out that this LAN family of distribu- 
tions is closely related to the twisting formula for Markov random walks; see (A. 13) in 
Appendix A. 3. Then we minimize the asymptotic variance of the Monte Carlo estimator. 
Our second contribution is to provide a tilting formula for efficient importance sampling 
in a hidden Markov model. We also present a Poisson equation which is required to 
characterize the optimal tilting formula and to facilitate its implementation. 

The rest of this paper is organized as follows. In Section 2 we consider a naive para- 
metric bootstrap algorithm for hidden Markov models and importance sampling in this 
context. In Section 3 we propose a tilting formula for efficient importance sampling 
in hidden Markov models. The implementation of the formula requires a streamlined 
computation procedure for the variance of the associated Markov random walk. This is 
developed in Section 4. Numerical results are reported in Section 5. The technical details 
are deferred to the Appendix. 

2. Bootstrapping hidden Markov models 
2.1. A naive bootstrap algorithm 

In this section we formulate the hidden Markov model as a Markov random walk 
with the underlying Markov chain as missing data. Specifically, let {X t ,t > 0} be a 
Markov chain on a finite state space D = {1,2, . . . ,d}, with transition probability matrix 
P = \Pij]i,j=i,...,d> an d stationary distribution ir = (m, . . . ,7i"d). Suppose that an additive 
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component S m = YlT=i w ith Yq = 0, taking values in R^, is adjoined to the chain such 
that {{X t , S t ),t>0} is a Markov chain on D x R e and 

P{(X u S t )eAx(B + y)\(X t ^,S t -i) = (i,y)} 
= P{(X t ,S t ) e A x B\(X t -i,S t -i) = (i,0)} 

= p(t,AxB) = Yl I PiMfMOMdy), (2.1) 

jeA J v^ B 

where fj{-;9) is the conditional probability density function of Y t given X t =j, with 
respect to a a-finite measure y on l'. Here 9 EM. K denotes the unknown parameter in 
both the transition matrix [pij] and the conditional density fj of the hidden Markov 
model. Note that {X t ,t > 0} is a Markov chain and, given Xq, X\,..., X m , the random 
variables Y\, . . . , Y m are independent with density functions fx t (•; 0), t=l,...,m. 

Definition 1 . If there is an unobservable Markov chain {X t , t > 0} such that the process 
{(X t , St),t> 1} satisfies (2.1), then we refer to {S t ,t> 1} as a hidden Markov model. 

The likelihood of a sample y = {yi , . . . , y m } from the hidden Markov model {St, t > 1} 

is 

d d m 

E •■■ E ^I[p* t -uxJx t (yf,o), (2.2) 

2o = l Xjn — 1 t—1 

where the initial distribution is the stationary distribution it. Let 9 be the MLE of 0, and 
V be an estimate of the asymptotic variance-covariance matrix of 0. We will discuss how 
to obtain V in (3.1). Under the regularity conditions given by Bickcl et al. ([5], pages 
1617-1618), the MLE 9 is asymptotically normal. We assume these conditions hold, and 
henceforth refer to them together as Condition R. Let P B be as in (2.1) such that 9 
equals the MLE 9 based on the observed data y. A bootstrap algorithm for estimating 
the sampling distribution of the standardized statistic T(m) := m 1 / 2 U _1 / 2 (0 — 9) is as 
follows: 

1. From P , generate a Markov chain realization of n steps (xq,x*, . . . ,x^)- 

2. For each xjf, obtain an observation yj* by a random draw from f x *(-;9). 

3. Compute the MLE 9* of the bootstrap sample 3^* = (y± ,Un) an d the correspond- 
ing asymptotic variance-covariance matrix V* . 

4. Approximate the sampling distribution of T(m) by the bootstrap distribution 

T*(n) = ^rl(V*)- 1/2 (9* -9). (2.3) 

In this algorithm, we use m to denote the original sample size and n to denote the 
bootstrap sample size. Wc follow this notation in our discussion of efficient resampling 
schemes. 
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2.2. Importance sampling for bootstrap estimates 

Suppose that we would like to estimate the probability of the event {T(m) £ A} for 
A C K K . Then the bootstrap estimate of P{T(m) e A} is u = P{T*(n) e A\y}. Consider 
an importance sampling problem in hidden Markov models. Instead of resampling from 
P e directly, as in the naive bootstrap algorithm described in Section 2.1, we resample 
from an alternative distribution Q. To be more precise, given y, let y\, . . . ,y B denote 
independent samples drawn according to the bootstrap algorithm under the probability 
distribution Q for the hidden Markov model {St, t > 1}. For b = 1, . . . , B, write T b as the 
version of T computed from . Then the importance sampling bootstrap approximation 
of u is 



When Q = P e , (2.4) is the approximation under the naive parametric bootstrap algo- 
rithm, the default resampling rule of this paper. To make the relationship between the 
default resampling rule and the importance sampling rule more transparent, we adopt 
the following notation. We denote the default resampling rule by replacing every occur- 
rence of the superscript f with *, with the understanding that the default resampling 
rule is the naive parametric bootstrap algorithm. That is, S 1 * = X) t=1 Y t * is a hidden 
Markov model according to Definition 1 under the probability P e . The P e probability is 
a conditional probability which depends on the sample y through 9. Because we always 
indicate random variables from P with '*', there is no danger of confusion. Henceforth, 
we drop the dependence on y for convenience. 

It is easy to see that u B is an unbiased estimate of u. It was shown in Hall [14] that 



Because u B is unbiased, the mean squared error of u B equals its variance. Note that u 
does not depend on Q. To minimize the variance (2.5) of u B , it is sufficient to minimize 
v by properly choosing Q from a suitable class of probability distributions. 

3. An exponential tilting formula 

3.1. An optimization problem in a LAN family 

Under Condition R, the MLE 9 is a smooth function of the sample mean; see Ghosh 
([13], Section 2.6). That is, there exists a smooth function g from Mr i— > K K such that 
9 = g(S m /m). Suppose that we would like to estimate the sampling distribution of the 
MLE 9 = (9 U 9 2 ,...,9 K ) T = (g 1 (S m /m),g 2 (S m /m),... 1 g K (S m /m)) T , where T denotes 




(2.4) 



v&y(u b ) = B 1 {v-u 2 ), where i = E l {r(n)6/1} — (/) 



(2.5) 



transpose. 
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Let £ be the I x £ variance-covariance matrix of S m = (Si m , . . . , Sg m ) T . Then we can 
estimate it by S = Eg , the variance-covariance matrix of S m under probability P e , which 



be the stationary mean of Y t * under P . Let J be the Jacobian matrix of g, and denote 
by J(i the Jacobian matrix of g at fj,. Let 



be the estimated variance-covariance matrix of 9 = g(S m /m). Note that estimating the 
conditional probability of the event {T* (n) G A} is asymptotically equivalent to estimat- 
ing 



We now study the problem of how to choose Q such that the variance of u B is mini- 
mized. From (2.5), this is equivalent to choosing Q such that 



In order to pose (3.3) as a well-defined minimization problem, we need to determine 
an appropriate class of Q probability distributions so that meaningful optimization can 
take place. It turns out that significant optimization can occur within a LAN family of 
probability distributions; see LeCam and Yang [18] for the definition of LAN. That is, 
we shall consider the family C of probability distributions, which are LAN at P . 

Note that in a LAN family, the magnitude of the asymptotic mean for the log-likelihood 
ratio equals half of its asymptotic variance. This is the key property that we need to solve 
the minimization problem (3.3). In Appendix A. 2, we construct a LAN family and show 
that it is closely related to the celebrated twisting formula for Markov random walks, 
studied by Miller [21] and Ney and Nummclin [23]. 

When the underlying Markov chain moves from state i to state j, we use and hj(y) 
to denote respectively the transition probability and the conditional probability density 
of an observation y under Q. Let Pij{9) and fj(y;9) be the transition probability and 
conditional density under P e . Note that for the rest of this section, n — > oo means the 
bootstrap sample size tends to infinity while the original sample size m remains fixed. 
We now define the class C of probability distributions as those satisfying the following 
conditions: 

(CI) The optimal tilting distribution is given by 




V = J A E JJ 



(3.1) 




(3.2) 




(3.3) 



Hjhj(y) =p ll {9)f J (y;9) exp 



(3.4) 
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(C2) Let Z be a normal random variable with mean zero and variance a 2 . Then as 
n —> oo, the log- likelihood ratio 



K = Jog 



dP e 

dQ 



7^ >Z+ r 



in distribution (3.5) 



for observations , Y*), t = l,...,n, from P e . Moreover, L* and n~ 1 / 2 (5* — 
nfi) are asymptotically jointly normal. 
(C3) The o(l) terms in (3.4) tend to as n — > co and are asymptotically negligible in 
determining the limiting distribution of (3.5). 

For importance sampling from Q in class C, it follows from (3.3) and (3.4) that we 
need to minimize, over Q, 



E<^ 1 



dP 6 



Yr t ^c xt _ iXt {Y*) + o{l) 



(3.6) 



L ^ T *^ eA >dg 

Since T*(n) is asymptotically normal and Q € C, it follows that as n — ► co, (3.6) tends to 



E[l {NgA} exp(7V L )], 



(3.7) 



where N = (N±, . . . ,N K ) T and Nl are jointly normal. The distribution of N is K-variate 
normal with zero mean and identity variance-covariance matrix, while the distribution 
of Nl is normal with mean \il and variance a\ . By (3.5), we have fiL = 

Let pk be the asymptotic correlation between the fcth component, y/n(9^ — 9k), of 
T*(n) and the log-likelihood ratio L* n for k — 1, . . . , n. Let = (<tlPi,o~lP2, ■ ■ ■ , o~lPk) T 
denote the covariance between N and L* . Then we can write the joint variance-covariance 
matrix of (N, Nl) t as 



/ 1 






P\0~L 









1 



P2&L 



PkVL 
-I 



where I K is the n x n identity matrix. Thus the optimization problem (3.3) is reduced to 
that of finding o~l and Sl so that (3.7) is minimized. 



3.2. The derivation of the optimal tilting formula 

The following lemma determines the minimum of (3.7). The proof of Lemma 1 can be 
found in Fuh and Hu [12]. 
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Lemma 1. The following choice of Sl and o~l minimizes (3.7): 



£ L = -±E(N|Ne^-£ L ), tr L = ^E££ L . (3.8) 

We now proceed to identify Oij{y) in (3.4) for the optimal Q such that (3.8) is satisfied. 
Observe that lim^^ cov(£jT* (n) , L* n ) = lim^^ cov(T*(n), L* n ) = £j£ L = of. On 
the other hand, from the Cauchy-Schwarz inequality it follows that 



lim cov(ZlT*(n),L* n )< lim Jw(E£T*(n))w(L*) = of. 

n — >oo n — >oo V 

Because the equality is attained only when L* is asymptotically equivalent to a linear 
function of £jT*(n) and thus asymptotically equivalent to a linear function of S n , we 
have 

n 

"- 1/2 Ek^,' P7) + "(1)] « c"~ 1/2 (^ - "A) (3-9) 
t=i 

for some constant cei'. 

Let riy be the number of i-to-j transitions and rii be the number of visits to state 
i by XI, . . . ,X*. We first represent rii/y/n in terms of riij/y/n. Let 7* be any constant 
independent of n; then n^n -1 / 2 « ?i -1 / 2 [7$ Y2j=i n ij + (1 — 7») Sj=i n ji] > as n ~~ * 00 • This 

is possible because 2j=i n »j — 53j=i n j» = — 1»(-^t»)j where 1»(") denotes the indi- 

cator function of state i. Later, we will specify the value of 7, so that other conditions are 
satisfied. Let us first assume that I = 1 - that is, Y t * , t = 1, 2, . . . , n, are one-dimensional 
- and then show that the generalization to the multi-dimensional case is straightforward. 
Let Wi = J2teD Yt*/ n ii i = 1, ■ ■ ■ ,d, where Dj = {t\X£ = i, 1 < t < n}. 
For i = l,...,d, summing Wi — fi with respect to i from 1 to d, we obtain 



n*. d d r d 



c- 



/- =c)]K-m)^=«cV (tBi - A)7, V + (m ; - /2)(1 - 7i) V 

V 2 — 1 V 2—1 L J — 1 V J — 1 V 



c 



i=1 i,j=l,i^j V 



(3.10) 



where Si = — (w; — fi)ji. Let Aj = {*|-^t-i = h %t = ji 2 < £ < 71}. By (3.9), match the 
coefficient of (3.10) with Ylt j=i Cij n ij / V™, where = "YliteD c ij^Xt )l n ij^ to obtain 



lim Cij = lim c(u)j — fx — <5j + <5 7 ). (3-H) 

n— ►oc n— >oc 

Note that lim„_» 00 cY ) - = / c ij (y)f : j(y,9)v(Ay). In view of 

CiM + o(iy 



3=1 



exp 



,1/2 
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I [i-^(y)n- 1 / 2 + o(n- 1 / 2 )]/,( 2/ ,% 4J (0>(d 2/ ), 

3=1 J 



we have 



lim y*CijPij0) = 0. 



(3.12) 



j'=i 



From (3.11) and (3.12) 


we conclude that <5j, 


i = l,.. 


.,d, 


satisfy 










d 






d 


lim ^ (u)j — /i - <5 ? : + 6j )pij 0) = 


=>■ lim 

n — >oc 




- fi)Pa 0) - 8 




3=1 






3=1 






3=1 


In matrix form, this becomes 














"Pll P12 ■ 


■ Pld~ 


-(wi - 


A)~ 






lim 


P21 P22 ■ 


■ P2d 


(w 2 - 


A) 






n — >co 


-Pdl Pd2 ■ 


■ Pdd- 


-{w d - 


A)- 







1 -Pll 



"Pl2 



-P21 1 - £>22 



"Pdl 



-Pd2 



-Pld 
-P2d 

1 -Pdd 



= 0, 



where we have dropped 9 from (0) for simplicity. Clearly, linin^oo lUj = E(V ( * | * = j) = [ij . 
Thus we can replace Wj by fij in the preceding matrix equation. 

It is easy to see that if Y t *,t = 1, . . . ,n, are multi-dimensional, the preceding matrix 
equality holds for each component of the random vectors Y t * . Denote by / as the identity 
matrix, and let I\ = E(Y^* — fi\X^ = i) be the adjusted conditional mean given X£ = i, 
for i = l,...,d. Write r = (I\) and A = (Si) = (Su), adxf matrix. Then the preceding 
matrix equation implies that A is a solution of the Poisson equation 



(I-P)A = PT. 
Let Si be the solution of (3.13). Consider choosing 

c ij (y) = 2Z , lV- 1 / 2 J il (y- jl + Si-Sj), 



(3.13) 
(3.14) 



where El is defined in (3.8). It can be shown that if we choose Cij(y) according to (3.14), 
then (3.8) is satisfied. 

The optimal tilting distribution is given by 



q ij h j (y)= % 



pjj 0)fj 0/> 0) ex p[- c »j [y)l \M 

cx p ( - c -ij ( v ) I V™ ) /j ■( v ) v ( ) 



(3.15) 
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where Cij(y) given by (3.14). Furthermore, let 

Cij = J exp[-c ij (y)/y/n\fj(y,e)v(dy). 

Then, in (3.15), we have 

qij = ( 3.i6) 

and 

hM =C^exp(-c ij (y)/V^)f j (yA (3-17) 

Note that due to the cancelation of Si and Sj in Cij{y) and Cy , hj(y) defined in (3.17) 
depends on the current state j only. We summarize our findings in the following theorem. 



Theorem 1. Let 9 be the MLE of the sample y = {yi, y 2 , ■ ■ ■ , y m } from a hidden Markov 
model (2.1) satisfying Condition R. To estimate the sampling distribution of 9, we do 
importance sampling according to the following procedure. 

(i) Sample from a Markov chain with transition matrix (3.16) to obtain {ccj, x\, . . . 

(ii) For each x\, i = 1, . . .,n, sample from h x t (-) of (3.17) to obtain yj. 

(iii) Calculate the MLE 9^ of the sample {y\ 7 . . . ,2/J,}. 

(iv) Repeat the preceding steps B times to obtain an approximation of (3.2) via (2.4). 

The preceding importance sampling scheme minimizes the asymptotic variance of (2.4) 
among all distributions within the class C defined in (3.4). 

Proof. We have shown that in order for the importance sampling estimator (2.4) to have 
minimum asymptotic variance in class C, it is necessary for the importance sampling 
distribution to satisfy (3.14)-(3.17). It remains to show that the resampling distribution 
given by (3.14)-(3.17) actually belongs to C. The details are given in Appendix A.l. □ 

4. Implementation of the tilting formula 

To implement the tilting formula (3.14) (3.17), wc need to know how to compute and 
V . Let us first consider the computation of El. Since El is only implicitly defined by 
(3.8), it cannot be evaluated directly. However, it can be employed to construct a recursive 
approximation algorithm. In this regard, it is easier to approximate —El- Changing El 
to El = —El in (3.8), we obtain 

EL = iE(N|NeA + E L ). (4.1) 

From (4.1), we can compute El via a recursive algorithm as follows: 
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(i) Initialize E L = E^ 0) . 

(ii) Iterate E^ +1) = |i?[N|N 6A + S^]. 

The convergence proof and some useful results on the implementation of the recursive 
algorithm can be found in Fuh and Hu ([12], Section 2). 

The computation of V, or, for that matter, the computation of E, the asymptotical 
variance for the Markov random walk S n , is much more complicated than that for the 
independent and identically distributed case where the sample covariance matrix does 
the job. Here we develop a representation which allows straightforward calculation of E 
via the solution of a Poisson equation. 

Let {X n ,n>0} be a finite ergodic (positive recurrent, aperiodic and irreducible) 
Markov chain on state space D = {l,...,d} with stationary distribution n. Let 
{(X n , S n ),n > 0} be the Markov random walk defined in (2.1). Then 

oo 

E = E n [(y x - /x) (Fx - ^) T )] + 2 E w m - M) (Y k+ i - M) T ] (4.2) 

fc=i 

is well defined if E w (||Yi|| 2 ) < oo. Furthermore, if E is positive definite, Theorem 17.0.1 
of Meyn and Tweedie [22] shows that 

—^(S n - no) -> N(0, E) in distribution. (4.3) 

Note that (4.2) is inconvenient to compute. We provide another representation of the 
asymptotic variance, E = [of;/];, i'=i, which facilitates the computation of it, 

d d 

a? v =Y,iG ll ,(i)-r l (i)T l ,(i)]7r i + ^ [T^-du+S^fpijWi, (4.4) 

where r,(i) - E(F i; - W |X = i), - E[(r i; - - W') T |^o = t], W is the Kb. 

component of the stationary mean, and 5u, i = 1, . . . , d, I = 1, . . . ,£, are the elements of 
the d X £ matrix A which is the solution of the Poisson equation 

(I-P)A = PT l (i), (4.5) 

in which / denotes the identity operator. The asymptotic variance formulae (4.4) and 
(4.5) for Markov random walks in general state space and their proofs are given in 
Appendix A. 2. 

5. Simulation study 

To demonstrate the effectiveness of our method, we study two examples. We measure the 
effectiveness by relative efficiency, which is defined to be the ratio of the variance under 
the default resampling distribution P to that under the tilted probability distribution Q 
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given by (3.14)-(3.17). We refer the reader to Fuh and Hu ([12], Section 5.1) for results 
on relative efficiency. 

To construct confidence regions through importance sampling, it is usually necessary 
to combine stratified sampling with importance sampling. That is, we need to partition 
the region into several parts and apply a different tilting formula to each part. These 
results are given in Fuh and Hu ([12], Section 4) and are used in Sections 5.1 and 5.2. 

The first example concerns a three-state hidden Markov chain {X tl t > 0} with tran- 
sition matrix P, and the {Y t ,t > 0} observations follow a bivariate normal distribution 
N((/J,ji, /j,j2) T given the states j = 1,2,3 of the Markov chain. Specifically, let 



P = 



0.2 
0.3 
0.5 



0.3 
0.4 
0.3 



0.5 
0.3 
0.2 



1.0 
0.3 



0.3 
1.0 



(5.1) 



/in = M12 = 0, H21 = M22 = 5 and fi^i = A*32 = 10- The second example concerns the time 
series of daily counts of epileptic seizures is given in Section 5.2. 



5.1. A bivariate normal example 

We first generate m = 100 observations Y\ = (Yu 7 Yi 2 ) T , . . . ,Y m = (Y m i, Y m2 ) T from the 
hidden Markov model defined in (5.1). The parameter of interest is the stationary mean 
(X^i=i 7i* J2i=i 7r i/ i i2) T - The estimator is the sample mean (/ii,/i2) T . Two different 
types of confidence regions, square and circular, are considered in this simulation study. 
The bootstrap sample size is n = 100, and the number of bootstrap replications is B = 
1000 for uniform resampling and B = 200, 100, 50 for importance sampling. The whole 
experiment was repeated 180 000 times to estimate the coverage probability and the mean 
and the standard deviation of confidence region areas. The nominal coverage probability 
is 0.95 in all cases. The results are summarized in Tables 1 and 2. The tilting points and 
the confidence regions are shown in Figures 1 and 2 for various nominal levels. 

In Table 1, we divide the complement of a square region into four subregions. The 
four optimal tilting points chosen according to (3.14) and (3.15) are (0,r), (r,0), (0, — r), 
(— r, 0), as shown in Figure 1(a), where r = 2.4613 by inverting the normal approximation. 
Note that the large-deviations tilting would use r = 2.236; see Fuh and Hu [12]. The four 
transition matrices and the conditional densities ftjf(-) ~ A r ((/i^ 1 ,/x^ 2 ) T , £) of Y given the 
hidden states j = 1,2,3 and optimal tilting points k = 1,.. .,4 can be calculated from 
(3.14)-(3.17). 

In Table 2, we divide the complement of a circular region into four subregions. The four 
optimal tilting points chosen according to (3.8) are (0, r), (r, 0), (0, — r), (— r, 0), as shown 
in Figure 2(a), where r = 2.655, whereas the large-deviations tilting would use r = 2.447. 
Similar to the square confidence region, the transition matrices and conditional densities 
can be calculated from (3.14)-(3.17). 

Tables 1 and 2 reveal that the importance sampling method permits a reduction of 
replication sizes from 5 to 1. The performance is still reasonable for a reduction from 10 
to 1. The only penalty seems to be a slight increase in the variability of the confidence 
region area. 
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5.2. A Poisson example 

Albert [1] described the fitting of a two-state Poisson hidden Markov model to the 
sequence of daily seizure counts recorded during follow-up for each of 13 outpatients 
with intractable epilepsy maintained on steady anticonvulsant drugs. Specifically, let 
X = (Xq,Xi, . . . ,X m ) be generated from a two-state (0 and 1) Markov chain with un- 
known transition probabilities po± and piq. Write p\\ = 1 —pio and poo = 1 — Poi- Given 
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X, let Y\, . . . , Y rn be the observed counts from the Poisson distributions 

e -A» \Vk 

P(Y k =y k \X k = i) = - -i-, i = 0,l, 

yfc! 

where Ao and Ai are the mean numbers of counts in states and 1, respectively. Let 
9 = (poi;PiO) Ao, Ai) be the parameter of interest. Balish et al. [2] demonstrated using 
quasi-likclihood regression models that all but one patient had seizure counts fitted in- 
adequately by a Poisson distribution. As reported in Albert [1], the two-state hidden 
Markov model provides a better fit and described the apparent clustering of seizures 
better than a Poisson regression model with autoregressive terms. 

To illustrate the efficiency of the proposed method in Theorem 1, we adopted estimates 
in Albert [1] of transition probabilities poi = 0.197, pxo = 0.61, and the Poisson means 
Ao = 0.251, Ai = 2.0 for a particular patient. Bootstraps are done by generating a random 
sample of size 100 with the aforementioned parameter values for the patient concerned. 
We then compute the MLE using the EM algorithm and generate bootstrap samples via 
the naive bootstrap algorithm in Section 2.1 and via importance sampling according to 
Theorem 1. The number of bootstrap replications is B = 1000 for uniform resampling, 
and B = 200, 100, 52 are used for importance sampling. As in Section 5.1, we obtain 
four tilting points and r = 2.4613. Table 3 shows that importance sampling permits a 
reduction of bootstrap replication sizes from 5 to 1. Figure 3 exhibits three confidence 
regions for (Ao, Ai) from importance sampling with n = 100 and B = 200. 




Figure 1. (a) The four tilting points and (b) 0.5, 0.95 and 0.99 bootstrap confidence regions 
for parameter estimates of a three-state model using importance sampling with n = 100 and 
replication size B = 200. 
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(a) (b) 

Figure 2. (a) The four tilting points and (b) the 0.5, 0.95 and 0.99 bootstrap circular confidence 
regions for parameter estimates of a three-state hidden Markov model using importance sampling 
with n = 100 and replication size B = 200. 



Table 3. Square confidence regions for the two-state model with parame- 
ters A = 0.251, Ai = 2.0, poi = 0.197, p w = 0.61 and 10 000 Monte Carlo 
repetitions 
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Appendix 

A.l. Proof of Theorem 1 

We drop '*' from y* and x% for simplicity. It is understood that in the following proof 
yt and xt are generated according to P , that is, the default sampling rule. From (3.15), 
the log-likclihood ratio is given by 
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n d „ 

io snx^ t -ij / cxp 



d ,. 



cxt-iAvt) 



EN 



cxp 



„l/2 

771/2 



fj(yt,0)v(dy)eTq> 



Cx t -i,x t (yt) 



fjfjIueMdy) 



n l/2 

c-x t -i,x t (yt) 



••(A.l) 



Let Ej denote the conditional expectation given xt = j. The integral in (A.l) with a 
two-term Taylor expansion of the exponential term equals 

1 E 3 [c xt _ u3 (y t )} y 3 [cl tl] {y t )] 

„i/2 + 2n + ° [n h 



By (3.12). multiplying the preceding expression by Px t -i,ji summing over j, and taking 
logarithms yield 



log 



1 | y^ Ej (i/t^t-Lj 
^ 2n 



+ o(n- x ) 



%[4 t _ 1 ,j(^)]Px t _ 1 ,j 



2tj 



3. .5 [■ 



. 5 



Figure 3. The 0.5, 0.95 and 0.99 confidence regions with parameters Ao = 0.251, Ai = 2.0, 
poi = 0.197, pio = 0.61 using importance sampling with n = 100 and replication B — 200. 
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In view of (3.14), summing the preceding expression over t, we find that the log term in 
(A.l) is approximated by 



t=i j=i 



2// 



When n is large, the proportion of time that the chain {xj} spends in state i is ap- 
proximately 7Tj, the stationary probability of state i. Using this fact in the preceding 
expression, we see that it is approximated by 

d d 

^l^-y^lY^Y^jiyt - A - s* + 5 s )(yi - A- 6t + Sj^punViiZs 1 ' 2 ^- 
»=i j=i 

The summation over i,j of the terms in the square bracket equals 

d d d 

J2 E ](Vt - N)(vt - Mi) T 7Tj +^2^2i E j(yt - A - & + 5j)\[^{y t -p,-Si + 5 J )} rT p lJ TT l . 
j=i j=i j=i 

The first term of the preceding expression can be rewritten as 

d 

^Ej[(j/ t - A)(y* - A) T - - A)(Mj - A) T ]^-, 

3=1 



and adding the second term shows that the sum is identical to the variance given by (4.4). 

2^L^L - J' 



By (3.1), we conclude that the log term in (A.l) equals ^E^Sl = \o~\ asymptotically. 



Consider the last term in (A.l). By (3.14), we have 

E Cxt-j^AVt) _ >^ T 1/2 jT yt - A ~ <W-i + &JH _ y.Ty-1/2 T T S " — Up, — 5 Xl + 



t=i 



Note that the last two terms in the numerator are negligible after dividing by n 1 / 2 . By 
(4.3), the preceding expression converges, in distribution, to a normal random variable 
with mean zero and variance a\. This shows that (3.5) is satisfied, which completes the 
proof. 



A. 2. Asymptotic variance of Markov random walks 

Let {X n ,n > 0} be an aperiodic and irreducible Markov chain on a general state space D 
with cr-algebra T>. The irreducibility is with respect to a maximal irreducibility measure 
ip on T>; see Meyn and Tweedie ([22], page 89) for definition. Suppose that an additive 
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component S n = X)fc=o ^ with Sq = Yq = 0, taking values in R f , is adjoined to the chain 
such that {(X n , S n ), n > 0} is a Markov chain on D x M. e and 

P{(X n+1 ,S n+1 ) eAx(B + s)\(X n ,S n ) = (x, a)} 

= P{(X u S 1 )GAx B\(X , So) = {x, 0)} = P(x, Ax B) 

for all x E D, s6l ! , A E V and B E B(E. e ), the Borel a-algebra of R e . The chain 
{(X n , S n ),n > 0} is referred to as a Markov additive process, and its additive component 
S n as a Markov random walk. 

Let v be an initial distribution on Xq and let E„ and vaiv denote expectation and 
variance under v, respectively. If v is degenerate at x, we simply write E^var^). If 
{X n , n > 0} has a unique stationary measure tt, let /i := Jj-, E.j;(£i)7r(dx) denote the sta- 
tionary mean. 

For any real-valued non-negative kernel {K(x, A);x E D,Ae T>}, function h : D — > M, 
and measure ^ on (D,T>), write 

Kh{x)= [ K(x,dy)h(y), ^K{A)= [ *(dx)K(x, A), 
Jd Jd 

^h(A) = / ^(dx)h(x) (a sign measure), (A. 2) 

J A 

^h = ^>h(D) (a real number). 



Condition Kl Minorization. There exist a probability measure ^> on D x R and a 
measurable function h on D such that J h(x)ir(dx) > 0, J ty(dx x 18L e )h(x) > 0, and 

P(x,Ax B)>h(x)^(Ax B), (A.3) 

for all xED : AeV, Be B{M. 1 ). 



For an aperiodic and irreducible Markov random walk satisfying Condition Kl, by 
making use of a splitting chain argument, there exists an equivalent Markov chain with 
a recurrent state; see, for example, Meyn and Tweedie ([22], Chapter 5). Thus, without 
loss of generality, we assume that there exists a recurrent state A in D such that the 
Markov chain X n visits the state A infinitely often. Let Ta = inf{n > 1 : X n = A} be the 
first recurrent time. 



Theorem 2. Let {{X n , S n ), n > 0} be a Markov random walk on state space D , satisfying 
Condition Kl. Assume 



Ea ( |l l Y kf\ <°o and E A T|<oo. 



(A.4) 
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Then 

oo 

E := E^Yi - fi)(yi - fif] +2^E 7r [(F 1 - fx){Y k+1 - /i) T ] (A.5) 

fc=i 

is well defined. Furthermore, if £ is positive definite, then 

— =(£„ — n/x) — ► iV(0, S) m distribution. (A. 6) 
TTie asymptotic variance, S= [of;/]u'=i....,f > can &e calculated via 
[ [G w -T l T v ]ir(dx)+ f \r,{a/)-S x i + S^t]\ri,(a/)-S xV + S^]P(x,^Mdx), (A.7) 

lu/iere r;(a;) = E x (Yi; - m),Gw(x) = E x (Yu - /Ui)(Yu' - /^/) a«d <5 X ; is a measurable 
function from D to M /or eac/i Z = 1, . . . ,£ satisfying the Poisson equation 

(I-P)S xl =PT l (x), (A.8) 

where I denotes the identity kernel and the operators in (A.8) are defined according to 
(A.2). 

Proof. By the regeneration method of Markov random walks developed in Ney and 
Nummclin [23], and following a proof similar to Theorem 17.3.6 in Meyn and Tweedie 
[22], we have the central limit theorem (A. 6). To derive (A.7), we need to show (a) the 
existence of a finite- valued solution 6 to the Poisson equation (A.8); (b) the uniqueness 
of S y i — 5 x i for all x,y £ D and I = 1, . . . ,£; and (c) the validity of the variance formula 
(A.7). 

Let A^ A = inf{n > : X n = A}, and 5 xl = E x (£f= uj where ui{x) = PE x Y ol - 

III = E x Yi; — Hi- Under the assumption (A. 4), S x i is well defined, and we have PS x i = 
MT, k =i MX k ))I {x ^A} +E A Elti MX k ) = E x (Ef=i ui(X k ))I {x7iA} . Therefore, for all 
x£D and l = l,...,£, P8 xl = V x (E k = M*k)) ~ «i(a?) = *xi - PTi(x), so that the Pois- 
son equation is satisfied, which establishes (a). 

The proof of (b) follows from Proposition 17.4. 1 of Meyn and Tweedie [22] . We now un- 
dertake the proof of (c). By Theorem 17.4.2 of Meyn and Tweedie [22], J D YfkLi ?i{x') x 
P (x,dx') is finite. Therefore, by a simple generalization of Theorem 3.3 of Billingslcy 
[6] , we have 

oo 

of,, =E w (y u - w )(^'-W') + 2^E w [(y li - w )(y ( fe+i)r-W')] 

= / G w (x)7r(dx)+ [ T t (x)~6 xl ,7r(dx)+ [ r l (i / )5 a / { /7r(dx / ), (A.9) 
JD Jd jd 
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where 

r oo 

(A.10) 



hi= I r l (x')f]p k (x,dx'). 
Jd k =i 



Next, we show that 8 x i satisfies the Poisson equation (A. 8). That is, for all x £ D and 
1 = 1,. ..J, 

{I-P)8 xl = / r z (x')VP fe (x,dx')- / / T l (x / )P(x,dz)y j P k (z,dx') 
Jd k=1 JdJd fe=1 

p oo „ oo 

= / r l (x')Y / P k (x,dx')- / Ti(x')Y^P^ +1 \x,dx') 
Jd k=i Jd k=i 

Ti(x')P(x,dx')=PT l {x). 

Write Si = 8.i\ then 8i and Ti are measurable functions on D. Let PSi be the function 
and 7rT;rV, irSi and irSiSi> be measures defined according to (A. 2), and write TiSi := 
Yi(x)S x i. Assuming that Si and P; satisfy (A. 8), then we have 

[Ti{x')-S x i+6 x ,i][r v (x') - S xV + 5 x , v ]P(x,Ax')-k{Ax) 

d Jd 

[ (Ti (x')r> {x 1 ) - Ti {x')8 xV - T v {x')5 x i + T t {x')8 x , v + Y v {x')8 xn 
> Jd 

+ $xiSxi> + Sx'iSx'v - SxiSx'v - Sx>iSxi>)n(dx)P{x, dx') 
= vPTiTv - ttSvPTi - >k5iPT v + nPT t 6 v 

+ nPT v 8i + TrSi8 v P + nP5i8 v - -k8iP8 v - nSyPSi 
= ttTiTi, - ttS v (I -P)8i- ttSi (I - P)8 V 

+ nTi8i> + ttTii8i + 2tt8iSi> - n8iPSi> - it8vP8i 
= ttTiTi, + TrTiSi> + ttTi'Si 

r l (x)r l ,(x)n(dx)+ [ r l (x)6 wP n(dx)+ [ Ti^S^midx'). (A.ll) 



By (b), S x >i — S x i is uniquely determined; then we can replace 8 x i in (A.ll) with 8 x i of 
(A. 10). Adding the first term of (A. 7) to (A.ll) and making use of (A. 9) establish the 
asymptotic variance formula (A. 7). □ 

A. 3. LAN family for Markov random walks 

The Markov chain discussed here is assumed to reside on a general state space D, whereas 
the application in (3.4) requires only a finite state space. Let x G D, $ = (#i, . . . , -di) € K f , 
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and let g be a bounded measurable function D. Define the linear operators P$ and P by 

(PtfffXz) = E x {e 9 ^g(X 1 )}, (Pg)(x) = E x {g(Xx)}, (A.12) 

where "•" denotes the inner product. We assume that E v e^' Yl < oo for all i? G 8 C M. e , 
where is an open set containing 0. 

Under Condition Kl, Theorem 4.1 of Ney and Nummelin [23] shows that P# has 
a simple maximal eigenvalue A($) with associated right eigenfunction r (•;'&). Further- 
more, there exists a set F C D with maximum irreducibility measure tp(F c ) = such 
that A(i?) = logA($) is analytic and strictly convex on 0, and r (•;'&) is uniformly pos- 
itive, bounded and analytic on for each x G F. Now, for -d G 0, define the 'twisting' 
transformation for the transition probability of {X n ,n > 0}, 

P4x,dx') = r -¥±^ e - A W^P(x,dx'). (A.13) 
r[x\ v) 

If the function A($) is normalized so that A(0) = dA/di?|^ = o = 0, then Pq = P is the 
transition probability of the Markov chain {X n ,n > 0} with invariant probability measure 

7T. 

For all i? € 0, let a(t) be a probability distribution on the set of non- negative integers, 
and let K$ = J^'tLo a (t)Pj > wnere Pj denotes the i-step transition of P^. A set E C D is 
called v a -petite if there exists a non-trivial measure f a on T> such that JQ(a;,j4) > v a (A) 
for all x G -E and AeV. 

Condition K2 Drift. There exists a function w: D — > [l,oo), a petite set E <E V, a 
constant b < oo, and an extended real-valued function V : D [0, oo] such that, for all 
x G D, P$V{x) < V (x) — w(x) + bl e(x) , where P$V(x) = J V(x')Ps(x, dx') and I denotes 
the indicator function. 

The following lemma characterizes the constants in (A.13) via a Poisson equation. 

Lemma 2. Assume that Conditions Kl and K2 hold for the Markov chain {X n ,n > 0} 
with corresponding V , w and b, such that J_ V(x)ir(dx) < oo. Assume that E^' Yl < oo 
for all $ G 0. Let fi = E OT ii. Then the partial derivatives of r(-;i9) with respect to 
dr(x; 'ff)/d'dk\®=o, for k = 1, . . are bounded on F C D , and are the solutions of the 
Poisson equation 

(I-P)g = P(E x Y 1 - f x), 
where I is the identity operator and P is the operator defined in (A.12). 

The proof of Lemma 2 can be found in Fuh and Hu ([11], Theorem 3). 
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Theorem 3. Under the assumptions of Lemma 2, let = r\l\fn and define the transi- 
tion probability QW = P{f n through (A. 13). Then Q^ L is LAN. In particular, 



lim 



-oo dP 



ex P( z ~ i^^V ), 



where Z is a normal random variable with mean zero and variance rf^'Sr]. 

Proof. By Theorem 4.1 of Ney and Nummelin [23], A(-) and r(x, •) are analytic on 
for each x G F C D. A straightforward Taylor expansion gives 



A 



= A(0) + -5=A'(0) + ^»? T A"(0)7/ + O 
\ n In 



V 

\ n 



1 

2n 



rj S77 + o 



logr( x, — = ) =logr(a;,0) 



r'(x, 0) 77 



1 



o 4= =r'(x,0)^= + o 4= 



r(x,0) \fn 

Applying the two preceding expansions to (A. 13), we obtain 



dP 



exp 



exp 



S n — nA 



V 



+ logr X n ,^= -logr X ,-^= 



V 



[S n -nn + r'{X n ,0) - r'(X , 0)] - hfHi] + o p (-^=\ 



(A.14) 



The first term in the exponent of (A.14) converges in distribution to a normal random 
variable with mean zero and variance rf r 'Er]. The proof is completed. □ 

To apply Theorem 3 to (3.4), we only need to check that / fj{y,9)e 8 y v(dy) < 00 in an 
open set of M. e , as Markov chains in Sections 2 and 3 arc of finite state so Conditions Kl 
and K2 obviously hold. 
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